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Abstract 

Our goal is to develop a state-of-the-art predictor with an intuitive and biophysically-niotivated 
energy model through the use of Hidden Markov Support Vector Machines (HM-SVMs), a recent 
innovation in the field of machine learning. We focus on the prediction of alpha helices in proteins 
and show that using HM-SVMs, a simple 7-state HMM with 302 parameters can achieve a Qo, value 
of 77.6% and a SOVa value of TiA%. We briefly describe how our method can be generalized to 
predicting beta strands and sheets. 



1 Introduction 

It remains an important and relevant problem to accurately predict the secondary structure of proteins 
based on their amino acid sequence. The identification of basic secondary structure elements-alpha 
helices, beta strands, and coils-is a critical prerequisite for many tertiary structure predictors, which 
consider the complete three-dimensional protein structure [6| [TTl [22] . To date, there has been a broad 
array of approaches to secondary structure prediction, including statistical techniques [TOl [HI [IS] , neural 
networks [a[l5l|20l[25l[22l[28l[2Hl[32], Hidden Markov Models [3l[71[l8l[2ll[23l[35l|36ll371|40l[4l], Support 
Vector Machines [H [5l [131 fT2l [Ml [39] , nearest neighbor methods |34] and energy minimization [T7] . In 
terms of prediction accuracy, neural networks are among the most popular methods in use today [9l [31] , 
delivering a pointwise prediction accuracy (Q3) of about 77% and a segment overlap measure (SOV) of 
about 74% [15J. 

However, to improve the long-term performance of secondary structure prediction, it likely will be 
necessary to develop a cost model that mirrors the underlying biological constraints. While neural 
networks offer good performance today, their operation is largely opaque. Often containing upwards of 
10,000 parameters and relying on complex layers of non-linear perceptrons, neural networks offer little 
insight into the patterns learned. Moreover, they mask the shortcomings of the underlying models, 
rendering it a tedious and ad-hoc process to improve them. In fact, over the past 15 years, the largest 
improvements in neural network prediction accuracy have been due to the integration of homologous 
sequence alignments [321 [T5] rather than specific changes to the underlying cost model. 

Of the approaches developed to date. Hidden Markov Models (HMMs) offer perhaps the most natural 
representation of protein secondary structure. An HMM consists of a finite set of states with learned 
transition probabilities between states. In biological terms, each transition corresponds to a local folding 
event, with the most likely sequence of states corresponding to the lowest-energy protein structure. HMMs 
generally contain hundreds of parameters, two orders of magnitude less than that of neural networks. In 
addition to providing a tractable model that can be reasoned about, the reduction in parameters lessens 
the risk of overlearning. However, the leading HMM methods to date [3l |3Q| have not exceeded a Qs 
value of 75%, and SOV scores are often unreported. 

In this paper, we focus on improving the prediction accuracy of HMM-based methods, thereby ad- 
vancing the goal of achieving a state-of-the-art predictor while maintaining an intuitive and biophysically- 
motivated cost model. Our technique relies on Hidden Markov SVMs (HM-SVMs), a recent innovation 
in the field of machine learning [T]. While HM-SVMs share the prediction structure of HMMs, the learn- 
ing algorithm is more powerful. Unlike the expectation-maximization algorithms typically used to train 
HMMs, training with an SVM allows for a discriminative learning function, a soft margin criterion, and 
bi-directional influence of features on parameters |T] . 

Using the HM-SVM approach, we develop a simple 7-state HMM for predicting alpha helices and 
coils. The HMM contains 302 parameters, representing the energetic benefit for each residue being in 
the middle of a helix or being in a specific position relative to the N- or C-cap. Our technique does 
not depend on any homologous sequence alignments. Applied to a database of all-alpha proteins, our 
predictor achieves a Qa value of 77.6% and an SOVq, score of 73.4%. Among other HMMs that do not 
utilize alignment information, it appears that our Q^ represents a 3.5% improvement over the previous 
best [23], while our SOVq, is comparable (0.2% better). However, due to differences in the data set, we 
emphasize the novelty of the approach rather than the exact magnitude of the improvements. We are 
extending our technique to beta strands (and associated data sets) as ongoing work. 



2 Related Work 

King and Sternberg share our goal of identifying a small and intuitive set of parameters in the design 
of the DSC predictor [16j. DSC is largely based on the classic COR technique [H], which tabulates 
(during training) the frequency with which each residue appears at a given offset (-8 to +8) from a 
given structure element (helix, strand, coil). During prediction, each residue is assigned the structure 
that is most likely given the recorded frequencies for the surrounding residues. King and Sternberg 
augment the GOR algorithm with several parameters, including the distance to the end of the chain 
and local patterns of hydrophobicity. They use linear discrimination to derive a statistically favorable 
weighting of the parameters, resulting in a simple linear cost function; they also perform homologous 
sequence alignment and minor smoothing and filtering. Using about 1,000 parameters, they estimate an 
accuracy of Qa = 73.5% for DSC. The primary difference between our predictor and DSC is that we 
achieve comparable accuracy (our Qa = 77.6%) without providing alignment information. Incorporating 
an alignment profile is often responsible for 5-7% improvement in accuracy |19 1 132 1 [30]. In addition, we 
learn the position-specific residue affinities rather than using the GOR frequency count. We also consider 
multiple predictions simultaneously and maintain a global context rather than predicting each residue 
independently. 

Many researchers have developed Hidden Markov Models (HMMs) for secondary structure prediction. 
Once it has been trained, our predictor could be converted to an HMM without losing any predictive 
power, as our dynamic programming procedure parallels the Viterbi algorithm for reconstructing the most 
likely hidden states. However, for the training phase, our system represents a soft-margin Hidden Markov 
SVM [IJ rather than a traditional HMM. Unlike an HMM, a Hidden Markov SVM has a discriminative 
learning procedure based on a maximum margin criterion and can incorporate "overlapping features", 
driving the learning based on the overall predicted structure rather than via local propagation. 

Tso chant aridis, Altun and Hofmann apply an integrated HMM and SVM framework for secondary 
structure prediction j37j. The technique may be similar to ours, as we are reusing their SVM imple- 
mentation; unfortunately, there are few details published. Nguyen and Rajapakse also present a hybrid 
scheme in which the output of a Bayesian predictor is further refined by an SVM classifier |2^ . The Qa 
score is 74.1% for the Bayesian predictor alone and 77.0% for the Bayesian/SVM hybrid; the SOVa score 
is 73.2% for the Bayesian predictor and a comparable 73.0% for the Bayesian/SVM hybrid. To the best 
of our knowledge, these are the highest Qa and SOVq, scores to date (as tested on Rost and Sander's 
data set [32]) for a method that does not utilize alignment information. 

Bystroff, Thorsson, and Baker design an HMM to recognize specific structural motifs and assemble 
them into protein secondary structure predictions [3 . Using alignment profiles, they report an overall 
Qs value of 74.3%. Our approach may use fewer parameters, as they manually encode each target 
motif into a separate set of states. Martin, Gibrat, and Rodolphe develop a 21-state HMM model 
with 471 parameters that achieves an overall Q^ value of 65.3% (without alignment profiles) and 72% 
(with alignment profiles) [21]. Alpha helices are identified based on an amphiphilic motif: a succession 
of two polar residues and two non-polar residues. Won, Hamelryck, Priigel-Bennet and Krogh give a 
genetic algorithm that automatically evolves an HMM for secondary structure prediction |40[ E] . Using 
alignment profiles, they report an overall Qs value of 75% (only 69.4% for helices). They claim that 
the resulting 41-state HMM is better than any previous hand-designed HMM. While they restrict their 
HMM building blocks to "biologically meaningful primitives" , it is unclear if there is a natural energetic 
interpretation of the final HMM. 

Schmidler, Liu, and Brutlag develop a segmental semi-Markov Model (a generalization of the HMM), 
allowing each hidden state to produce a variable-length sequence of the observations |35[ [36] . They report 
a Qs value of 68.8% without using alignment profiles. Chu and Ghahramani push further in the same 
direction, merging with the structure of a neural network and demonstrating modest (~1%) improvements 



Category 


Predictor 


Number of Parameters 


Neural Net 


PHD [32 


> 10,000 


Neural Net 


SSPro 2 


1400-2900 


Neural Net 


Riis & Krogh 30 


311-600 


GOR + Linear Discrimination 


DSC [16] 


1000 


HMM 


Martin et al. 121! 


471 


HM-SVM 


this paper (alpha only) 


302 



over Schmidler et al. [7]. 

While our technique is currently limited to an alpha helix predictor, for this task it performs better 
(Qa = 77.6%) than any of the HMM-based methods described above; furthermore, it does so without 
any alignment information. Our technique is fundamentally different in its use of Hidden Markov SVMs 
for the learning stage. Lastly, some groups have applied HMM-based predictors to the specific case of 
transmembrane proteins, where much higher accuracy can be obtained at the expense of generality [ISj . 

There has been a rich and highly successful body of work applying neural networks to secondary struc- 
ture prediction. The efforts date back to Quian and Sejnowski, who design a simple feed-forward network 
for the problem [29j. Rost and Sander pioneered the automatic use of multiple sequence alignments 
to improve the accuracy as part of their PHD predictor [32j, which was the top performer at CASP2. 
More recently, Jones employed the PSI-BLAST tool to efficiently perform the alignments, boosting his 
PSIPred predictor [15j to the top of CASP3. Baldi and colleagues employ bidirectional recurrent networks 
in SSPro [2], a system that provided the foundation for Pollastri and McLysaght's Porter server [28]. Pe- 
tersen describes a ballotting system containing as many as 800 neural networks; while an ensemble of 
predictors is commonly used to gather more information, this effort is distinguished by its size [27j. A 
neural network has been followed by an HMM, resulting in a simple and fast system [20j ; neural networks 
have also been used as a post-processing step for GOR predictors [25] . 

The PSIPred predictor [15] is among the highest scoring neural network techniques. While it achieves 
an overall Q3 of about 77% and an SOV of 74%, its performance for alpha helices is even higher: for 
recent targets on EVA, an open and automatic testing platform ^, PSIPred offers an SOVq of 78.6% 
(EVA does not publish a Qa value comparable to ours). 

Though state-of-the-art neural network predictors such as PSIPred currently out-perform our method 
by about 5%, they incorporate multiple sequence alignments and are often impervious to analysis and 
understanding. In particular, the number of parameters in a neural network can be an order of magnitude 
higher than that of an HMM-based approach (see Table [2|) . A notable exception is the network of Riis 
and Krogh, which is structured by hand to reduce the parameter count to as low as 311 (prediction 
accuracy is reported at Q3 = 71.3% with alignment profiles, a good number for its time). 

Recently, Support Vector Machines (SVMs) have also been used as a standalone tool for secondary 
structure prediction [211 [391 13 IH [El HI] • Iii contrast to our technique, which uses an SVM only for 
learning the parameters of an HMM, these methods apply an SVM directly to a window of residues 
and classify the central residue into a given secondary structure class. The number of parameters in 
these techniques depends on the number of support vectors; in one instance, the support vectors occupy 
680MB of memory [39]. Regardless of the number of parameters, it can be difficult to obtain a biological 
intuition for an SVM, given the non-linear kernel functions and numerous support vectors. Nonetheless, 
these techniques appear to have significant promise, as Nguyen and Rajapakse report an overall Q3 of 
79.5% and an SOV of 76.3% on the PSIPred database l24j. 



3 Algorithm 

3.1 Formulation as an Optimization Problem 

According to thermodynamics, a folded protein is in a state of minimum free-energy (except when kinetic 
reasons get the protein stuck in a local minimum). We therefore approach the protein structure problem 
as an optimization problem. We want to find a free-energy function G(x, y), which is a function of x, the 
protein's amino-acid sequence and y, the protein's secondary structure. To predict a protein's structure 
y, we perform the following minimization: 

y = argminG(x, y) (1) 

yey 

To go from this general statement to a working algorithm, we need to find free-energy function G and 
a set of structures 3^ for which the minimization shown in equation ([T|) is easy to compute. In choosing 
G and 3^, we tradeoff the ability to efficiently minimize G with the ability to accurately capture the 
richness and detailed physics of protein structure. Atomistic models are able to capture the whole range 
of structures, and incorporate all the physical interactions between atoms. However, they can only be 
optimized using heuristic methods. We therefore prefer to consider a simplified set of structures 3^, and 
a cost function G with lumped parameters that try to approach the physical reality. 

These lumped parameters are difficult to determine experimentally. We will therefore define a class G 
of candidate free-energy functions that are easy to optimize over some set of structures 3^. Then we will 
use machine learning techniques to pick a good G from all the candidates in Q. The machine learning 
will use structure information from the Protein Data Bank [26j to determine which G to pick. Given a 
set of training examples {(xj,yj) : i = 1, . . . , k}, the learning algorithm needs to find a G £ G such that: 

Vi : yj = argmin G(xi , y ) (2) 

yey 

In practice, this G may not exist or may not be unique so the machine learning algorithm may have to 

pick a good approximation, or select a G that is more likely to generalize well to proteins not in the 

training set. We will now look more closely at how a good G is selected, and later, in Section [3.51 we will 

be more specific about what Q and 3^ are. 

3.2 Iterative Constraint Based Approach 

First, we notice that equation ([2|) can be rewritten as the problem of finding a function G that satisfies 
the large set of inequality constraints 

Vi,Vy G 3^\{ya : G{^i,yi) < G{^,,y). (3) 

Unfortunately, the set of all secondary structures y is exponentially large, so finding a G £ G that 
satisfies all these inequalities directly is computationally intractable. Our approach reduces the problem 
by ignoring as many constraints as possible, only considering the constraints it is "forced" to consider. 

In our method the reduced problem is defined as the problem of finding a function G' that satisfies 
the set of constraints 

Vi,VyGSi:G'(xi,yO<G'(xi,y), (4) 

for some Si C 3^\ {yj. 

Initially, we begin with no constraints at all, that is. Si = for all i and we choose some function 
G' £ Q. Notice that, we start with no constraints, therefore, any function G' £ Q satisfies equation (j4|). 
We need to check whether G' approximates the solution G to the set of (l2|). In particular, we verify 
whether G' can be used to approximate yi as the solution yi of the optimization problem 

yi = argmin G'(xi,y). 

yey 



If G'(xi,yi) < G'(xi,yi) + £, we say that yi is "close" to yi in the sense that yi is a close enough 
approximation of yi. If yi is close to yi, we go on to the next optimization problem, 

y2 = argminG'(x2,y). 

yey 

If yi is not close to yi, this means the constraint G'(xi,yi) < G'(xi,yi) in equation ([3| has been violated. 
Therefore we must add this constraint to our reduced problem; we replace Si by Si U {yi}- In order to 
solve the new reduced problem we need to find a new G' that satisfies the old and new constraints. At all 
times the number of constraints in the reduced problem is relatively small such that it is computationally 
feasible to find its solution. 

Whenever a prediction yj is not satisfactorily close to yj, we add more constraints. For instance, 
Figure [T] shows our problem reduction for the training example (xi,yi). Note that the reduced problems 
lead to the constraints G'(xi,yi) < G'(xi,yi), G'(xi,yi) < G'(xi,y^), G'(xi,yi) < G'(xi,y245)^ etc.. 

That is, each prediction is a good 

+ e where y* = argmin G'(xj, y). (5) 



where y = {y^, y^, . . . , y™} (in other words. Si 



{y'yy})- 



The algorithm terminates if no constraints need to be added 
approximation, 

Vi : G'(xi,yj) < G'(xi,y 



argmin G'(xj,y). 
yey 



This is equivalent to 



Vi,Vy G y \ {ji} : G'i^uYi) < G'(xi,y) + e. 
This shows into what extend the function G' satisfies the full set of constraints in equation 



(6) 
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Figure 1: Summary of the learning method. In this figure each large frame represents a problem that 
needs to be solved. On the left, we start with an intractably large problem. At each iteration, we pick a 
subset of the large problem to work on, solve it approximately using an SVM formulation, and use the 
resulting solution to expand the subset of constraints we are working with. 



3.3 Linear Cost Function 

One important assumption we make is that the family of free energy functions Q is linear. That is, the 
total free energy of the protein is a sum of elementary interactions. This simplification agrees with many 
mathematical models of the energy force fields that control protein folding. For example, electrostatic. 
Van der Waals, stretch, bend, and torsion forces all are described by the sum of energy terms for each 
pair of molecular elements. Given this, we can formally define the family of functions G to be 

Q = {Gw : (x,y) — > (w,^(x,y)) : for some w}. (7) 

Here the feature function ^ is fixed and known, representing some specific energy characteristic that 
we are interested in. By definition of a linear function the dot product of the vector w (notated by 



6 



(,)) can then be taken to appropriately weight the importance of individual terms within *&. With this 
assumption, the reduced problem's constraints given by equation (|3]) can be rewritten as 

Vi,Vy G 5i : Gw(xi,yi) < G'w(xi,y). (8) 

In order to solve the reduced problem, we need to find the unknown weight vector ■w such that these 
constraints are satisfied. Again, since Gw is a linear function, this set of constraints can translate into 

Vi,Vy g5, :(w,A^i(y))>0, (9) 

where A^j(y) = ^(xj,y) — ^(xj,yj). This reformulation of the constraints allows this problem to 
be solved in a much more elegant and computationally efficient manner. In our method we use the 
powerful technique of support vector machines to quickly determine the function Gw, although many 
other techniques are possible. 

3.4 Iteratively Constraining Support Vector Machines 

Support Vector Machines (SVMs) are a fast and effective tool for generating functions from a set of 

labeled input training data. SVMs are able to determine a set of weights w for the function Gw that will 

allow Gw to accurately map all of the training example inputs x, to outputs y^. They do this by solving 

the dual of the minimization problem 

1 G " 
w = argminmin — ||w|| H > ^j (10a) 

1=1 
under the constraints 

Vi, Vy G Si : (w, A^i(y)) > 1 - ^^ with Vi : ^i > 0. (10b) 

We can therefore use SVMs to determine our function Gw, however this only solves half of our 
problem. Given a candidate Gw we must then determine if equation ([3| has been violated and add more 
constraints to it if necessary. To accomplish this task, we build off of work done by Tsochantaridis et al. 
|38j which tightly couples this constraint verification problem with the SVM w minimization problem. 

First a loss function A(yj, y) is defined that weighs the goodness of the structures yj- Adding this to 
the SVM constraints in equation (|10bp gives 

Vi,Vy ^Si:ii> A(yi,y) - (w,A^,(y)) (11) 

Using this we can decide when to add constraints to our reduced problem and which constraints to 
add. Since at every iteration of the algorithm we determine some ■w for the current Si, we can then find 
the smallest possible SVM "slack variable" values for ^j in equation (jlOap . This minimum ^j will be 

ii = max(0,maxA(yi,y) - (w, A^i(y))) (12) 

This minimum ^j, which was determined using Si can be compared to a similar ^[ that is obtained 
by instead maximizing over y \ {yi} in equation (|12p . This will tell us how much the constraints we are 
ignoring from y \ {yj} will change the solution. The constraint that is most likely to change the solution 
is that which would have caused the greatest change to the slack variables. Therefore we would add the 
constraint to Si that corresponds to 

y' = argmaxA(yi,y) - (vi^, A^j(y)). (13) 

yey 

Tsochantaridis et al. [38j show that by only adding constraints when y' would change ^j by more than 
e, one can attain a provable termination condition for the problem. The summary of this overall process 
can be seen in Algorithm [TJ 



1 Input: (xi,yi),...,(x„,y„), C, e 

2 S'j ^ f or all 1 < i < n 

3 repeat ( 

4 for i = 1 , . . . , n do ( 

5 Set up the cost function ff(y) = A(yj,y) — (■w, A^j(y)) 

6 Compute y = s^rgmax g-y -^^(y) 

7 Compute ^j = max{0,maxyg5- i7(y)} 

8 if Hijj) > ^i + e then ( 

9 5i ^ Si U {y} 

10 w <— optimize over S = UiSi 

11 ))) until no Si has changed during iteration 

Algorithm 1: Algorithm for iterative constraint based optimization. 

3.5 Defining the Set of Vahd Structures 

One final issue remains to be solved to complete our algorithm. We need to specify what y is, and how 
to optimize G(x,y) over 3^. Indeed, in general 3^ can be exponentially large with respect to the sequence 
length, making brute-force optimization impractical. Our general approach will be to structure 3^ and 
G(x, y) in a way that will allow optimization through dynamic programming. 

Most secondary-structure prediction tools use local features to predict which regions of a protein 
will be helical [STj. Individual residues can have propensities for being in a helix, they can act as helix 
nucleation sites, or they can interact with other nearby residues. This type of information can be well 
captured by Hidden Markov Models (HMMs). Equivalently, we choose to capture them using Finite State 
Machines (FSMs). The only difference between the FSMs we use and a non-stationary HMM is that the 
HMM deals with probabilities, which are multiplicative, while our FSMs deal with pseudo-energies, which 
are additive. To a logarithm, they are the same. 

We define y to be the language that is recognized by some FSM. Thus a structure y S 3^ will be a 
string over the input alphabet of the FSM. For example that alphabet could be {h,c}, where h indicates 
that the residue at that position in the string is in a helix, and c indicates that it is in a coil region. A 
string y is read by an FSM one character at a time, inducing a specific set of transitions between internal 
states. Note, the FSMs we are considering do not need to be deterministic. However, they do need to 
satisfy the property that, for a given input string, there is at most one set of transitions leading from the 
initial state to a final state. We denote this sequence of transitions by o"(y) and note that o"(y) need not 
be defined for all y. 

To define G(x, y), we create the cost function -(/'(x, t, i) which assigns a vector of feature values 
whenever a transition t is taken at position i in the sequence x. These feature values determine the total 
cost G(x, y) by 

, \ _ / +°*^ if |x| ¥" |y| or '^(y) is undefined . . 

I (■«^)Z]iV'(x, cr(y)i,i)) otherwise 

This cost is easy to optimize over y by using the Viterbi algorithm [33]. This algorithm proceeds in 
|x| rounds. In round i, the best path of length s starting from an initial state is calculated for each FSM 
state. These paths are computed by extending the best paths from the previous round by one transition, 
and picking the best resulting path for each FSM state. The complexity of the algorithm is 0{\FSM\ ■ |x|), 
where \FSM\ is the number of states and transitions in the FSM. 



4 Results 



We now present results from our implementation of our algorithm, 
and uses SVM**™^VSVM'*9^* [14J by Thorsten Joachims. 



It was written in Objective Caml, 



4.1 Finite State Machine Definition 

In our experimentation, we have used an extremely simple finite state machine that is presented in 
Figure [2j Each state corresponds to being in a helix or coil region, and indicates how far into the region 
we are. States H4 and C3 correspond to helices and coils more than 4 and 3 residues long, respectively. 
Short coils are permitted, but helices shorter than 4 residues are not allowed, as the dataset we used did 
not contain any helices less than 4 residues long. 

The features that were used in our experiments are presented in Table [H The exact way in which 
they are associated with transitions in the FSM is indicated in Table [2j 



C,#0 




H,#3 



Figure 2: The finite state machine we used. Double circles represent accept states. The arrow leading 
into state C3 indicates that it is an initial state. Each transition is labeled with the type of structure it 
corresponds to: helix (H) or coil (C), and a feature label (#i) indicating which features correspond to this 
transition in Table [21 



Name 


Number of features 


Comment 


A 


1 


Penalty for very short coil 


B 


1 


Penalty for short coil 


Hr 


20 


Energy of residue i? in a helix 


Cr 


140 


Energy of residue R at position i relative to C-cap 


Nk 


140 


Energy of residue R at position i relative to N-cap 


Total 


302 





Table 1: Summary of features that are considered. 



Label 


Features 


Comment 


#0 





Coil defined as zero-energy 


#1 




End of helix processing (C-cap) 


#2 


Hr„ + Et=-3 ^R„l^-i 


Start of helix processing (N-cap) 


#3 


Hr„ 


Normal helix residue 


#4 


Hr„ + A 


Helix after very short coil 


#5 


Hr^+B 


Helix after short coil 



Table 2: Features that arise from each transition in the FSM. Ri denotes the residue at position i in the 
protein, and n is the position at which we are in the protein. 



We have experimented with various loss functions A (see Section I3.4p . We have tried a 0-1 loss 
functions (0 unless both structures are identical), hamming distance (number of incorrectly predicted 
residues), and a modified hamming distance (residues are given more weight when they are farther from 
the helix-coil tranitions). Each one gives results slightly better than the previous one. 

None of the features we have used involve more than one residue in the sequence. We have done some 
experimentation with more complicated cost functions in which pairwise interactions betweens nearby 
residues in a helix, namely between n and n-|-3 or n and n+A. So far we have not managed to improve our 
prediction accuracy using these interactions, possibly because each pairwise interaction adds 400 features 
to the cost function, leaving much room for over-learning. Indeed, with the expanded cost functions we 
observed improved predictions on the training proteins, but decreased performance on the test proteins. 



4.2 Results 

We have been working with a set of 300 non-homologous all-alpha proteins taken from EVA's largest 
sequence-unique subset of the PDB [8j at the end of July 2005. The sequences and structures have been 
extracted from PDB data processed by DSSP. Only alpha helices have been considered (H residues in 
DSSP files); everything else has been lumped as coil regions. 

In our experimentation, we have been splitting our 300 proteins into two 150 protein subsets. The 
first set is used to train the cost function; the second set is used to evaluate the cost function once it has 
been learned. Since the results vary a bit depending on how the proteins are split in two sets, we have 
trained the cost function on 20 random partitions into training and test sets, and taken averages. 

We present results using both the Qq, and SOVq, metrics. The Qa metric is simply the number of 
incorrectly predicted residues divided by sequence length. SOVq, is a more elaborate metric that has been 
designed to ignore small errors in helix-coil transition position, but heavily penalize more fundamental 
errors such as gaps appearing in a helix [42] . 



Description 


SOVq (%) 
(train) 


SOVq (%) 

(test) 


Qa (%) 
(train) 


Qa (%) 
(test) 


Training 
time (s) 


Best run for SOVq 


76.4 


75.1 


79.6 


78.6 


123 


Average of 20 runs 


75.1 


73.4 


79.1 


77.6 


162 


Stardard deviation of 20 runs 


1.0 


1.4 


0.6 


0.9 


30 



Table 3: Results of our predictor. We have provided an average case. 
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(a) SOVc (b) Qc 

Figure 3: Histograms showing the distribution of Qq and SOVq across proteins in the test set. We have 
shown the average case, and the best case which has the highest SOVq. 
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Our results have been obtained for a slack variable weighting factor C = 0.08 in equation (jlOap . The 
algorithm termination criterion was for e = 0.1. Both of these parameters have a big impact on prediction 
accuracy and training time. 

5 Conclusion 

In this paper we have present a method to predict alpha helices in all-alpha proteins. The HMM is 
trained using a support vector machine method which iteratively picks a cost function based on a set of 
constraints, and uses the predictions resulting from this cost function to generate new constraints for the 
next iteration. 

On average, our method is able to predict all-alpha helices with an accuracy of 73.4% (SOVq,) or 
77.6% (Qa). Unfortunately, these results are difficult to compare with existing prediction methods which 
usually do predictions on both alpha helices and beta strands. Rost and Sanders caution that restricting 
the test set to all-alpha proteins can result in up to a 3% gain in accuracy [32j. In addition, recent 
techniques such as PSIPred [15] consider 3-10 helices (the DSSP state 'G') to be part of a helix rather 
than loop, and report gains of about 2% in overall Q^ if helices are restricted to 4-helices (as in most 
HMM techniques, including ours). 

The real power of the machine learning method we use is its applicability beyond HMM models. 
Indeed, instead of describing a protein structure as a sequence of HMM states, we could equally describe 
it as a parse tree of a context-free grammar or multi-tape grammar. With these enriched descriptions, we 
should be able to include in the cost function interactions between adjacent strands of a beta-sheet. This 
should allow us to incorporate beta-sheet prediction into our algorithm. Unlike most secondary structure 
methods, we would then be able to predict not only which residues participate in a beta-sheet, but also 
which residues they are forming hydrogen bonds with in adjacent sheets. 
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A Example learned weight vector 

Tables S] and [5] show the w vector that led to the best test SOV. 



A 


-86 


B 


-43 



Table 4: Residue independent pseudo-energies. 





Hr 


N^' 


iv- 


N,' 


N% 


Nk 


Nl 


Ni 




^R 


C^r' 


^0 


ck 


^R 


ci 


G 


-1731 


443 


26 


-73 


250 


150 


-179 


-319 


-277 


1 


123 


369 


-833 


215 


187 


A 


764 


534 


484 


800 


-745 


-628 


-471 


-528 


-357 


-386 


-452 


-499 


41 


580 


336 


V 


997 


512 


603 


727 


-824 


-794 


-583 


-311 


-340 


-588 


-667 


-879 


-68 


706 


501 


I 


1683 


611 


540 


858 


-1364 


-1202 


-1001 


-425 


-388 


-591 


-815 


-990 


380 


822 


381 


L 


1440 


756 


879 


989 


-1143 


-1057 


-743 


-394 


-392 


-447 


-614 


-826 


450 


948 


669 


F 


734 


653 


559 


686 


-750 


-592 


-551 


-332 


-283 


-478 


-718 


-601 


30 


581 


433 


P 


-4024 


376 


-110 


-232 


2325 


1479 


601 


-178 


-132 


169 


283 





-2343 


-607 


-327 


M 


645 


623 


554 


736 


-930 


-750 


-300 


-309 


-349 


-340 


-450 


-511 


141 


778 


615 


W 


769 


550 


558 


864 


-551 


-435 


-356 


-184 


-255 


-488 


-647 


-762 


-265 


-29 


236 


C 


-1507 


56 


-253 


-262 


50 


-204 


-276 


-292 


21 


308 


296 


482 


-844 


113 


-195 


s 


-769 


575 


383 


547 


85 


55 


-125 


-314 


-451 


-281 


-167 


35 


-573 


448 


304 


T 


-14 


706 


689 


968 


-235 


-56 


-23 


-205 


-522 


-679 


-489 


-434 


-592 


425 


248 


N 


-917 


498 


235 


463 


-140 


-194 


-461 


-454 


-242 


-114 


65 


231 


-438 


308 


153 


Q 


556 


656 


445 


849 


-512 


-533 


-378 


-372 


-373 


-399 


-464 


-706 


-50 


742 


450 


Y 


495 


435 


335 


457 


-771 


-581 


-579 


-448 


-249 


-462 


-385 


-433 


-94 


569 


517 


H 


-664 


322 


106 


324 


-26 


-68 


-158 


-324 


-291 


14 


146 


327 


-269 


473 


270 


D 


-559 


886 


614 


890 


299 


230 


207 


16 


-499 


-510 


-498 


-214 


-725 


183 


208 


E 


296 


637 


522 


747 


-352 


-186 


-183 


-292 


-416 


-379 


-261 


-344 


-88 


487 


414 


K 


11 


567 


373 


476 


-522 


-446 


-414 


-327 


-226 


-203 


-164 


-165 


-91 


548 


441 


R 


329 


323 


367 


642 


-429 


-476 


-317 


-297 


-269 


-435 


-412 


-369 


-58 


583 


374 



Table 5: Residue dependent pseudo-energies 
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